Posters - Schedules

Posters Home

View Posters By Category

Monday, July 11 and Tuesday, July 12 between 12:30 PM CDT and 2:30 PM CDT
Wednesday July 13 between 12:30 PM CDT and 2:30 PM CDT
Session A Poster Set-up and Dismantle Session A Posters set up:
Monday, July 11 between 7:30 AM CDT - 10:00 AM CDT
Session A Posters dismantle:
Tuesday, July 12 at 6:00 PM CDT
Session B Poster Set-up and Dismantle Session B Posters set up:
Wednesday, July 13 between 7:30 AM - 10:00 AM CDT
Session B Posters dismantle:
Thursday. July 14 at 2:00 PM CDT
Virtual: Annotating and Indexing Scientific Articles with Rare Diseases
COSI: TextMining
  • Hosein Azarbonyad, Elsevier, Netherlands
  • Zubair Afzal, Elsevier, Netherlands
  • Max Dumoulin, Elsevier, Netherlands
  • Rik Iping, Erasmus MC University Medical Center Rotterdam, Netherlands
  • George Tsatsaronis, Elsevier, Netherlands


Presentation Overview: Show

In Europe 30 million people are suffering from a rare disease. Rare disease patients are entitled to the best possible health care, constituting the efficient organization of the respective clinical care and scientific literature imperative. This requires deep bibliometrical analysis that can be based in the efficient annotation and indexing of the respective scientific literature.

With this work, we are presenting a novel methodology to annotate scientific articles with concepts that describe rare diseases from the OrphaNet taxonomy (orphadata.org). The technical challenges are several: first, some rare diseases are only rare in a specific part of the population; second, some of the rare diseases are very similar conceptually; third, the OrphaNet taxonomy might be incomplete in certain areas; and, fourth, polysemy and synonymy of the names of rare diseases may still hinder the applicability of any annotation engine. We will discuss how Elsevier has used TERMite, a state of the art annotation engine (to query OrphaNet concepts on Scopus) to address some of these challenges, in combination with advanced NLP and Text Mining techniques. We will demonstrate the results of such an analysis in rare diseases research, and highlight some directions for future research that may address the open challenges.

Virtual: CoSMEL - A transformer based approach to build context-specific metabolic models
COSI: TextMining
  • Sai N, Indian Institute of Technology, Madras, India
  • Roshan Balaji, Indian Institute of Technology, Madras, India
  • Pavan Kumar S, Indian Institute of Technology, Madras, India
  • Nirav Pravinbhai Bhatt, Indian Institute of Technology, Madras, India


Presentation Overview: Show

Genome-scale models can illuminate the molecular basis and provide a mechanistic understanding of the underlying phenotype. Since only a subset of reactions is active in different biological conditions, numerous algorithms have been developed to build context-specific models which require confidence score for each reaction to be provided as input. Previous studies have focused on using transcriptomics of a context for generating these confidence scores. Often, a vast amount of literature has been published in terms of publications for a particular context. These literature data could be a better alternative or complementary information that mitigates the process of arbitrary thresholding in transcriptomics. However, the enormous growth of biomedical-literature data makes manual metabolic-network reconstruction tedious for a context. We propose an automated pipeline - CoSMEL (Context-Specific Metabolic model Extraction using Literature data), which couples Named Entity Recognition (NER) with constraint-based model extraction algorithms to build context-specific models from literature data. CoSMEL uses transformer models to perform NER tagging on biomedical text data. The named entities are then used to extract active reactions and their importance in a context. We demonstrate the ability of CoSMEL to reconstruct the metabolic-network of small intestine enterocytes using PubMed abstracts and a genome-scale model of human (Recon)

Virtual: Evaluating the Effectiveness of Back-Translation Data Augmentation on Biomedical Sentence Similarity
COSI: TextMining
  • Aman Patel, Weill Cornell Medicine, United States
  • Mingquan Lin, Weill Cornell Medicine, United States
  • Yifan Peng, Weill Cornell Medicine, United States


Presentation Overview: Show

Biomedical Semantic Textual Similarity measures the degree of semantic equivalence between two pieces of text. While pre-trained BERT models achieve superior performance to conventional machine learning models, they are prone to overfitting due to inadequate training data. To solve this problem, we apply a back-translation method to augment the data. Specifically, we apply a pre-trained machine translation model to translate text from English to Chinese and re-translate it back into English. We then fine-tuned the BERT sentence similarity model and evaluated it on the BIO-SSES dataset using the Pearson correlation, an evaluation metric to measure the linear correlation between the predicted score and the gold labels. We repeated the experiments 20 times to obtain a distribution of the Pearson correlation and reported the standard deviation. Experiments show that the BERT model with data augmentation was superior to the one fine-tuned on the original data (0.8583 ± 0.0147 vs. 0.5438 ± 0.0668). The results indicate that the back-translation data augmentation can boost model performance, showing the effectiveness of the proposed method.

Virtual: Exploring Gene-Disease Associations by Gene Network Analysis on PubMed
COSI: TextMining
  • Can Beslendi, Institute of Medical Informatics, University of Münster, Münster, Germany, Germany
  • Michael Fujarski, Institute of Medical Informatics, University of Münster, Münster, Germany, Germany
  • Sarah Sandmann, Institute of Medical Informatics, University of Münster, Münster, Germany, Germany
  • Julian Varghese, Institute of Medical Informatics, University of Münster, Münster, Germany, Germany


Presentation Overview: Show

Assembling valid and thorough information on genes, characterized by mutations or differential expression, and their relation in different disease entities, is a challenging task. Searching for a gene like “TP53” in PubMed returns around 25,000 results. To extract the information of interest, tedious manual screening would be necessary. Available applications like PolySearch2.0 just provide simple pattern-matching algorithms.
To improve current workflows, we propose a novel approach using Text Mining and Natural Language Processing. The Python module NLTK (Natural Language Toolkit) is used to process text corpora from different databases, e.g. PubMed, fully automatically. Sentences and words are tokenized by the module. Subsequently, genes and conditions are identified using Named Entity Recognition. Finally, sentences matching the input query are determined and analyzed. Relations between different genes as well as between genes and disease entities are revealed.
To ensure wide dissemination and use of our approach also by non-bioinformaticians, we developed an intuitive GUI. For given genes or disease entities, analysis of all matching database entries is performed. Results are efficiently visualized by means of an interactive gene network. Associated publications can be accessed directly. Thereby, our tool supports researchers to analyze huge amounts of data within minutes and generate new hypotheses.

Virtual: Statistical design considerations for lexicon-based information extraction
COSI: TextMining
  • Mireya Diaz, Homer Stryker MD, School of Medicine Western Michigan University, United States


Presentation Overview: Show

A large proportion of information extraction (IE) via natural language processing (NLP) is based on lexicons. Although IE is considered a low-level task it is a fundamental step for higher-level NLP tasks, and its performance determines the performance of the system. A key statistical design consideration for this task is whether such lexicon exists or needs to be built. For the later it is imperative to know how large the sample size of the corpus needs to be. A short answer to this question is “the more the better”. However, logistics, validation, and resource constraints limit that ""infinite"" size.
To develop a useful lexicon among the design considerations are: size of the lexicon sought, the documents’ word amount, prevalence of non-stop words, distribution of appearance of unique tokens, distribution of words' length. Categorization of words adds complexity, and if so whether these categories are mutually exclusive or exhibit some overlap. This work illustrates the pertinent steps to estimate the corpus size and their application in real-world scenarios.

J-001: Creation and expansion of an inclusive and diverse lifestyle factor ontology for biomedical text mining
COSI: TextMining
  • Katerina Nastou, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Yijia Xie, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Esmaeil Nourani, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Danai Vagiaki, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Sampo Pyysalo, TurkuNLP group, Department of Computing, Faculty of Technology, University of Turku, Finland, Finland
  • Søren Brunak, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Lars Juhl Jensen, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark


Presentation Overview: Show

While dictionary-based text mining of genes, diseases, and their interactions is well established, methods for identifying the equally important lifestyle factors are lacking. This is in part because no manually annotated corpus of disease-lifestyle associations exists, and in part because existing ontologies focus on specific aspects of lifestyle and, even when combined, do not capture important disease-associated factors – like socioeconomic status – well.
To address the latter issue, we manually developed a prototype lifestyle factor ontology, which aims to cover the full diverse range of lifestyle factors. Since introducing bias is unavoidable when manually creating an ontology, we trained a BioBERT model to detect additional terms from the scientific literature, not present in the current ontology, and added these in the ontology after manual checking. Following this expansion, an extensive manual effort was undertaken to resolve ambiguities with other biomedical dictionaries, thus generating a “cleaner” version of the ontology. This ontology was then used in combination with disease ontology, to extract co-occurrences of lifestyle factors and diseases in the biomedical literature, to identify the intricate nature of relationships between these two entities.

J-002: BioRED: A Comprehensive Biomedical Relation Extraction Dataset
COSI: TextMining
  • Ling Luo, NCBI, NLM, NIH, United States
  • Po-Ting Lai, NCBI, NLM, NIH, United States
  • Chih-Hsuan Wei, NCBI, NLM, NIH, United States
  • Cecilia Arighi, University of Delaware, United States
  • Zhiyong Lu, NCBI, NLM, NIH, United States


Presentation Overview: Show

Automated relation extraction (RE) from biomedical literature is critical for text mining application development in both research and the real world. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g., protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we present BioRED, a first-of-its-kind biomedical RE corpus with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene-disease; chemical-chemical), on a set of 600 PubMed articles. Further, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including BERT-based models, on the named entity recognition (NER) and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a comprehensive dataset can successfully facilitate the development of more accurate, efficient, and robust RE systems for biomedicine.

J-003: A deep-learning approach for contextualizing antimicrobial resistance genes
COSI: TextMining
  • Arman Edalatmand, McMaster University, Canada
  • Xue Ji Zhao, McMaster University, Canada
  • Saduni Rajapaksa, McMaster University, Canada
  • Ramkrishna Upadhyaya, McMaster University, Canada
  • Abdalmuhaymen Ibrahim, McMaster University, Canada
  • Andrew G. McArthur, McMaster University, Canada


Presentation Overview: Show

Antimicrobial outbreak publications outline the key factors involved in an uncontrolled spread of infection. Such factors include the environments, pathogens, hosts, and antimicrobial resistance genes involved. Individually, each paper published in this area gives a glimpse into the devastating impact drug resistance has on healthcare, agriculture, and livestock. When examined together, these papers reveal a story across time, from the discovery of new resistance genes to their dissemination to different pathogens, hosts, and environments.

My work aims to extract this information from publications by using the biomedical deep-learning language model, BioBERT. BioBERT is pre-trained on all abstracts found in PubMed and has state-of-the-art performance with language tasks using biomedical literature. I trained BioBERT on two tasks: entity recognition to identify AMR-relevant terms (i.e., AMR genes, taxonomy, environments, geographical locations, etc.); and relation extraction to determine which terms identified through entity recognition contextualize AMR genes. Datasets were generated semi-automatically to train BioBERT for these tasks. My work currently collates results from 204,094 antimicrobial publications worldwide and generates interpretable results about the sources where genes are commonly found. Overall, my work takes a large-scale approach to collect antimicrobial resistance data from a commonly overlooked resource.